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This paper addresses the estimation of the nonparametric condi- 
tional moment restricted model that involves an infinite-dimensional 
parameter go- We estimate it in a quasi-Bay esian way, based on the 
limited information likelihood, and investigate the impact of three 
types of priors on the posterior consistency: (i) truncated prior (pri- 
ors supported on a bounded set), (ii) thin-tail prior (a prior that has 
very thin tail outside a growing bounded set) and (iii) normal prior 
with nonshrinking variance. In addition, go is allowed to be only par- 
tially identified in the frequentist sense, and the parameter space does 
not need to be compact. The posterior is regularized using a slowly 
growing sieve dimension, and it is shown that the posterior converges 
to any small neighborhood of the identified region. We then apply our 
results to the nonparametric instrumental regression model. Finally, 
the posterior consistency using a random sieve dimension parameter 
is studied. 

1. Introduction. We consider a conditional moment restricted model 
(1-1) E(p(Z,g )\W,g ) = 0, 

where (Z T , W T ) is a vector of observable random variables, and W may 
or may not be included in Z. Here p is a one-dimensional residual function 
known up to go. The conditional expectation is taken with respect to the con- 
ditional distribution of Z given W and go, assumed unknown. The parameter 
of interest is go, which is infinite dimensional. Moreover, suppose we observe 
independent and identically distributed data {(Zf, Wf)}^ =1 of (Z T ,W T ). 

Model (1.1) is a very general setting, which encompasses many important 
classes of nonparametric and semiparametric models. 
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Example 1.1 (Regular nonparametric regression). Consider the model 

Y = g (W)+e 

assuming W) = 0. Let Z = (Y, W), then it can be written as the condi- 
tional moment restricted model with p(Z,go) = Y — go(W). 

Example 1.2 (Single index model). Consider the single index model 

Y = h (W T 9 ) + e, 

where E{e\ W) = 0. The parameter of interest is (ho,9o), with ho being non- 
parametric. This type of model is studied by Ichimura (1993) and Anto- 
niadis, Gregoire and McKeague (2004). By defining Z = (Y, W), go = (ho,9o) 
and p(Z,g ) = Y - ho(W T 9 ), we can write E(p(Z, g )\W, g ) = 0. 

Example 1.3 (Nonparametric IV regression). Consider the nonpara- 
metric model 

Y = g (X)+£, 

where X is an endogenous regressor, meaning that E{e\X) does not vanish. 
However, suppose we have observed an instrumental variable W for which 
-E(£|W) = 0; then it becomes a nonparametric regression model with instru- 
mental variables (NPIV), studied by Newey and Powell (2003) and Hall and 
Horowitz (2005). Define p(Z,g ) = Y — g (X), with Z = (Y,X). Then we 
have the conditional moment restriction. 

Example 1.4 (Nonparametric quantile IV regression). The nonpara- 
metric quantile IV regression was previously studied by Chernozhukov and 
Hansen (2005), Chernozhukov, Imbens and Newey (2007) and Horowitz and 
Lee (2007). The model is 

y = go(X)+e, P(e<0\W) = 7 , 

where go is the unknown function of interest, and 7 E (0, 1) is known and 
fixed. Assume A is a continuous random variable. Then the conditional 
moment restriction is given by 

E{p{Z,g )\W,go) = 0, p(Z, go) = I( y < go (x)) ~ 7- 

If we define G(g) = Ew{E(p(Z, g)\W, go)] 2 , an equivalent way of writing 
model (1.1) is then G(go) =0. When the unknown function go depends on 
certain endogenous variable as in Examples 1.3 and 1.4, the identification 
and consistent estimation of go is challenging. On one hand, there can be 
multiple functions in the parameter space that satisfy the moment restric- 
tion (1.1). On the other hand, even if go is identified, [in which case the func- 
tional G{g) is uniquely minimized at g = go, as is typically assumed in the 
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literature], reducing G(g) toward G(go) does not guarantee that \\g — go\\ s 
will also be close to zero, for a certain norm || • || s of interest. Therefore, 
minimizing a consistent estimator of G{g) does not lead to a consistent esti- 
mator of go under || • || s . This phenomenon is usually known as the "ill-posed 
inverse problem" in the literature. 

The general form of (1.1) was first studied by Ai and Chen (2003) and 
Newey and Powell (2003), where the authors considered sieve approximation 
of go and estimated it in a compact parameter space. Recently, Chen and 
Pouzo (2009a) relaxed the compactness assumption and achieved the con- 
sistency and convergence rate using the penalized sieve minimum distance 
estimation. In recent years there has also been extensive literature on the 
NPIV model (Example 1.3) itself. In these papers, the authors introduce 
a Tikhonov tuning parameter to play a role of "regularization" in order to 
overcome the ill-posed inverse problem; see, for example, Hall and Horowitz 
(2005) and Darolles et al. (2011). Other related works on the nonparametric 
instrumental variables can be found in Chernozhukov, Gagliardini and Scail- 
let (2008), Johannes, Van Bellegem and Vanhems (2010), Horowitz (2007, 
2011), among others. 

Compared to the growing literature from the frequentist perspective, 
there is very little understanding of the consistent estimation using either 
a Bayesian or a quasi-Bayesian approach. This paper proposes a quasi- 
Bayesian procedure and studies the impact of various priors of go on the 
posterior consistency. Our setup is built on a sieve approximation technique 
similar to Chen and Pouzo (2009a), which assumes that go can be approx- 
imated arbitrarily well on a finite-dimensional sieve space. In order to keep 
our procedure robust to the distribution specification and convenient for 
practical implementation, without specifying a known distribution on the 
data generating process, we employ a limited information likelihood [Kim 
(2002) and Liao and Jiang (2010)], a moment-condition-based Gaussian ap- 
proximated likelihood. The use of such a likelihood is more straightforward 
for models characterized by either moment conditions or estimating equa- 
tions than the common methods based on Dirichlet process priors in the 
nonparametric Bayesian literature. With priors placed directly on the sieve 
coefficients, we show that the proposed posterior is consistent. Due to the 
difficulty of identifying go in practice, we do not assume go to be necessarily 
identified. As a result the posterior consistency here means that, asymp- 
totically, the posterior converges into arbitrarily small neighborhood of the 
region where go is partially identified. Therefore, we also extend model (1.1) 
to the partial identification setup [Chernozhukov, Hong and Tamer (2007) 
and Santos (2012)]. We will consider three types of priors: (i) priors sup- 
ported on a bounded set (truncated prior), (ii) priors with tails decaying 
fast outside a bounded set (thin-tail prior) and (iii) Gaussian priors with 
nonshrinking variance. 
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Recently, Florens and Simoni (2009a) proposed a quasi-Bayesian approach 
for the NPIV model. They assumed that the error term follows a normal 
distribution and achieved consistency by regularizing an operator that de- 
fines the posterior mean. Our approach differs from theirs essentially in the 
way of overcoming the ill-posed inverse problem. While Florens and Simoni 
(2009a) put a Gaussian prior on an infinite-dimensional function space, they 
require the variance of the prior to shrink to zero. In contrast, we place the 
prior directly on the sieve coefficients in a finite-dimensional vector space 
and require the sieve dimension to grow slowly with the sample size. Our 
approach then corresponds to Chen and Pouzo's (2009a) sieve minimum 
distance procedure using slowly growing sieves. As a result, it is the finite- 
dimensional sieve that plays the role of regularization instead of a shrinking 
prior. In addition, our approach allows nonnormal priors. 

Models based on moment conditions as (1.1) have been proved to be essen- 
tial in many statistical applications, such as financial asset pricing [Gallant 
and Tauchen (1989), Chen and Ludvigson (2009)], consumer behavior in 
economics [Blundell, Chen and Kristensen (2007), Santos (2012)] and re- 
turn to college education [Horowitz (2011)]. Therefore, this paper develops 
a quite convenient and straightforward quasi-Bayesian approach for these 
applied problems. 

The remainder of this paper is organized as follows: Section 2 introduces 
general theorems on two types of posterior consistency, which provide suf- 
ficient conditions under which a posterior constructed on a sieve space is 
consistent. Section 3 specifies the priors and shows the consistency results 
by verifying the sufficient conditions given in Section 2. Section 4 studies in 
detail the NPIV model as a specific example. Section 5 discusses the case 
of the random sieve dimension. Finally, Section 6 concludes with further 
discussions. Proofs are given in the supplementary material. 

Throughout the paper, for any two positive deterministic sequences {a n }^ =1 
and {b n }^ = i, write a n y b n and b n -< a n if b n = o(a n ). In addition, a n ~ b n if 
there exist c\ and C2 > such that c\b n <a n < C2b n for all large enough n. 

2. General posterior consistency theorems. 

2.1. Sieve approximation. Suppose we are interested in a nonparametric 
regression function go G (fi, || * ||s)- which is assumed to be inside an infinite- 
dimensional Banach space T~L endowed with norm || • Examples of the 
space (T~L,\\ • ||s) include: space of bounded continuous functions with norm 
\\g\\ s = sup x \g(x)\ , the space of square integrable functions {g : E[g(X) 2 ] < 
oo} with \\g\\ s = \J E[g(X) 2 ], etc. In addition, suppose there exists a set of 
basis functions {4>i,(j)2, ...} CH such that go £ H can be approximated by 
a truncated sum gi, = X)i=i f° r a vector of coefficients (b\, . . . , b qn ) T , 
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where q n is a pre-determined constant that grows to infinity. Then g b lies in 
an approximating space H n spanned by {4>i, . . . , 4> Qn }. Here H n grows to be 
dense in H, called a sieve approximating space. 

There is extensive literature on the posterior consistency using sieve ap- 
proximation. Shen and Wasserman (2001) applied an orthogonal basis ex- 
pansion to the nonparametric regression problem. Walker (2003) and Choi 
and Schervish (2007) provided general results for a class of Bayesian re- 
gression models when the data have a normal distribution. Other results 
on nonparametric regression problems can be found, for example, in Huang 
(2004), Ghosal and van der Vaart (2007), etc. 

Suppose we are given n independent identically distributed observations 
X n = (X\,X2, ■ ■ ■ ,X n ). In this paper we do not assume any specific distri- 
bution of X n \go, but propose a quasi-Bayesian approach, which is based on 
a pseudo-likelihood, 

L(g b ) = exp(-^G(g b )^J , 

where G:T~L n — > [0, oo) is a stochastic functional, which we call the sample 
risk functional. Suppose there exists a nonnegative functional G, such that 
for a bounded set T n C % n , 

sup \G(g b ) - G{g b )\ = o p (l). 

We call G the objective functional or risk functional throughout the paper. 

In the literature, it is often assumed that the true regression function go 
is point identified (as opposed to "partially identified" in the following) as 
the unique minimizer of G on H, that is, 

{go} = argminG(st). 

gen 

Then quasi-Bayesian approaches usually construct G as the sample analog 
of G [see Chernozhukov and Hong (2003)]. In many applications of the model 
considered in this paper, however, it is more natural to assume that G has 
multiple global minimizers on H; see detailed discussions in Section 3. In 
this case, we say go is partially identified (in the frequentist sense) on 

0/ = argminG(g), 
g&H 

and 0/ is called the identified region. Therefore 0/ is the main object of 
interest in this paper. 

For any b = (&i, . . . , b qn ) T G M 9n , let g b = YaIi h<t>i- Similarly to the stan- 
dard treatments in Smith and Kohn (1996) and Antoniadis, Gregoire and 
McKeague (2004), we put prior 7r(6) on the sieve coefficients b = (bi, 02, ■ ■ ■ , 
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bq n ), and obtain a posterior distribution, 

P(g b \X n )<XTv(b)L(g b ). 

For any giGH, define 

d(gi,@i) = inf \\gi - g\\ s , 

960/ 

and the e-expansion as a neighborhood of the identified region 

Q £ j = {geH :%,©/)<£}. 

Then the posterior consistency in this paper refers to the following: for any 

e>0, 

P( g ee £ j\x n )^ p i. 

2.2. Posterior consistency theorems. We first present two theorems of 
general posterior consistency using the sieve approximation, which involve 
conditions on the tail probability of tt as well as the performance of G. They 
are based on the following variant of an inequality from Jiang and Tanner 
(2008), Proposition 6. These inequalities will be proved in the supplementary 
material [Liao and Jiang (2011a)]: 

Lemma 2.1. Suppose the support of the prior 7r can be partitioned as 
F n U F^- Then for any deterministic sequence S n > 0, 

E[p(G{g b ) - inf G{g) > b5 n \X n ) } 
<P( sup \G{g)-G{g)\>8 n 

e ~2n5 n 

+ 



(2.1) 



n(G(g b ) - inf 5eW G(g) < S n n g b G T n 
+ EP(g b G^\X n ). 



In addition, 



EP{g b G F c n \X n ) < P( sup \G(g) - G(g)\ > S n 
+ 7r(^)e 2 ^ 



■K(G(g b ) - mf g&n G(g) <5 n Dg b e F n ) ' 

These inequalities imply the following result on the risk consistency: 

Theorem 2.1 (Risk consistency). Suppose the following conditions hold 
with respect to a deterministic positive sequence 5 n : 
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(i) Tail condition: as q n and n — > oo, either EP(g b G J-^\X n ) = o(l) or 
vr(^) = 0(e- 4 ^). 

(ii) Approximation condition: ir(G(g b ) — inf 5g % G(g) < <5 n , G .F n ) >- e 2n<5 " 

(iii) Uniform convergence: P[sup ge jr n \G(g) — G(g)\ > S n ] = o(l). 
T/ien we Ziaue i/ie risA; consistency result at rate 5 n 

p(G(g b ) - inf G(g) < 5 n \X n ) = 1 - o p (l). 
\ gen / 

The naming of these conditions is obvious, except for (ii). There, the 
approximation refers to the ability of the functions in T n (proposed by the 
prior 7r) to approximately minimize the risk G over T~L with not-too-small 
prior probability. 

When the following condition is added, the risk consistency leads to the 
estimation consistency. 

Theorem 2.2 (Estimation consistency). Suppose there exists a sequen- 
ce 5 n such that the following conditions hold: 

(i), (ii), (iii) in the previous theorem; 

(iv) (distinguishing ability) for any e > 0, 

inf G(g) - inf G(g) >- 6 n . 
gen n ,g^e e j gen 

Then for any e > 0, we have 

(2.2) P(g b G 9? \X n ) 1. 

Proof. Theorem 2.1 is implied by Lemma 2.1. Now we prove Theo- 
rem 2.2. For any s > 0, by Theorem 2.1, 

P(g b ^e £ j\X n ) 



<P[g b £ e%G(g b ) - MG(g) < 5 n \X n ) + 0p (l) 

V g£H J 

< P(g b i 6f , G{g b ) > inf G(g),G(g b ) - inf G(g) < 5 n \X n ) + 0p (l) 

v g&H n ,g<fQj g£H / 

< Pfifc i ef,<5 n < G( 56 ) - inf G(g) < <5 n |X n ) + 0p (l) 

V gen / 
= °p( 1 ) ; 

where the third inequality is implied by condition (iv) for all large n. □ 

As a special case of these results, note that when go is point identified as 
the unique minimizer of G(g) on H, that is, 6/ = {go}, (2.2) then becomes 

P(\\g b -go\\ s <e\X n )^l, 

the regular posterior consistency result. 
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In the subsequent sections, we will construct a so-called limited informa- 
tion likelihood G(g) and apply the previous two theorems to the conditional 
moment restricted model (1.1), by verifying conditions (i)-(iv). 

3. Conditional moment-restricted model. 

3.1. Limited information likelihood. Consider a conditional moment con- 
dition 

(3.1) E[p(Z,g )\W,g } = 0, 

where go S % is the true nonparametric structural function. Here W is d- 
dimensional, with fixed d. For simplicity, throughout the paper, let us as- 
sume W is supported on [0, l] d , as one can always apply the transformation 
on each component of W, W% — > where <&(•) is the standard normal 

cumulative distribution function. We focus on the case when p is a one- 
dimensional function. 

Following the setting of Ai and Chen (2003) and Chen and Pouzo (2009a), 
we approximate H. by a sieve space H n that grows to be dense in H. Here H n 
is a finite-dimensional space spanned by sieve basis functionsj^i, . . . , 4> qn } 
such as splines, power series, wavelets and Fourier series. 

As the first step, we transform the conditional moment restriction into 
unconditional moment restrictions (but still conditional on go). Let {[(i — 
l)/k n , i/^n]}i=i De a partition of [0,1], for some k n E N. We then obtain 

, fed - 

a partition of the support of W : [0, 1] = (J 7=1 Rj ', where for each j = 1, . . . , 



(3.2) RJ = H 



k - 1 k_ 

k ' k 



for some i\ G {1, . . . , k n }. 



1=1 

We require k n — > 00 as n — > 00. Let X = (Z, W). For each j, define 



m 



a.) 



(g,X) =p{Z,g)I( We m), 



where is the indicator function. Let m n (g,X) = (m n i(g,X), . . . ,m nk d(g, 
X)) T , which is a k% x 1 vector. Equation (3.1) then implies 

(3.3) Em n (g ,X)=0, 

where the expectation is taken with respect to the joint distribution of 
X = (Z,W) conditional on go- Throughout the paper, the expectation is 
always taken conditionally on g$ . When k n > q n there are more moment 
conditions than the parameters, and hence (3.3) is a problem of many mo- 
ment conditions with increasing number of moments studied by Han and 
Phillips (2006). 

It is straightforward to verify that 

Vq = Var(m n (# , X)) = d\&g{E{p{Z,g Q ) 2 I {w&FC{) ), . . . ,E(p(Z,g ) 2 I {WeR n d) )}. 
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For each g G H, and j = 1, . . . , k%, write m nj (g) = ^ Ya=i m nj(g,^i) and 
^in(ff) = (^niid), ■ ■ ■ , ™nkd{g)) T ■ Instead of go, we construct the posterior 
for its approximating function inside % n . Under some regularity conditions, 
for each fixed k, m n (go) would satisfy the central limit theorem: for any 
a € M fc , as n goes to infinity, 



(3.4) 



P{y/nV 1/2 m n (g ) < a) - J 



0. 



This motivates a likelihood function on the sieve space H n , 

LIL(g b ) oc eKp(^^m n (g b ) T VQ l m n (g b )^ . 

According to Kim (2002), the function LIL(g b ) can be more appropriately 
interpreted as the best approximation to the true likelihood function under 
the conditional moment restriction by minimizing the Kullback-Leibler di- 
vergence, which is known as the limited information likelihood (LIL). Note 
that LIL ((ft) is not feasible, as Vq depends on the unknown function go] 
therefore Kim (2002) suggested replacing Vq with a constant matrix (not 
dependent on go), while maintaining the order of each element. For each ele- 
ment on the diagonal, suppose we have the integration mean value theorem: 
for some w* G R™, 

E{p{Z,go) 2 I {w ^)) = E(p(Z,g ) 2 \W = w*)P{W G R]) = 0(P(W G R])) 

provided that sup wg [ ,i] d P[p(Z ■> 9o) 2 \ w \ <oo. Hence each diagonal element 
of Vq is of the same order as P(W G i?"). We replace Vq by 



1 n 

V = diag{£>i , . . . , v k d } where Vj = — I { 



Each Vj is a consistent estimate of P(W G Rj)- We thus obtain the feasible 
LIL to be used as the likelihood function throughout this paper, 

(3.5) L(g b ) = exp f- ^m n {g b ) T V~ l m n {g b )\ . 

The feasible likelihood puts more weights on the moment conditions with 
smaller variance, having the same spirit of the optimal weight matrix in 
generalized method of moments [Hansen (1982)]. A more refined approach 
can be based on a second-stage estimation of Vq, where a consistent first- 
stage estimator of go is used if go is assumed to be point identified. However, 
it turns out that Vq does not have to be estimated very precisely in order 
to achieve the posterior consistency for the inference on g. We will show 
that our simple estimator V is already good enough for proving posterior 
consistency in the development to be described below and is simple for 
practical computations. 
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For the approximated Gaussian likelihood function (3.5), the sample risk 
functional defined in Section 2 is given by 

(3.6) G(g b ) = rh n (gb) T V~ l fh n (g h ). 
Let 

= I bi4>i(x) : max < B n > 

for some sequence B n — > oo ; then we partition the sieve space into H n = T n U 
J~n- Under some regularity conditions, it can be shown that 1 G converges in 
probability to the risk functional 

(3.7) G(g) = E w {[E(p(Z,g)\W)] 2 }= f [E(p(Z,g)\W = w)f dF w (w) 
uniformly on J- n . 

3.2. Identification and ill-posedness. The identification of go is charac- 
terized by minimizing G. To be specific, define the identified region for go, 

Q I = {g£-H:E(p(Z,g)\W = w)=0 for almost all w G [0,l] d }, 

which is assumed to be nonempty, then 

6/ = argminG(g) = {g G U : G(g) = 0}. 
gen 

If Qj is a singleton, then 0/ = {go}- Otherwise go is partially identified 
on 0/; see, for example, Santos (2012). 

In the conditional moment restriction literature, the problem of identifica- 
tion and estimation of go is well known to be ill posed. The ill-posed problem 
was postulated in detail by Kress [(1999), Chapter 15], which occurs, in our 
context, if one of the following three properties does not hold: (1) there exist 
solutions to G(g) = 0, and here we assume go G 0/; (2) the solution is unique, 
that is, 0/ is a singleton; (3) the solution is continuously dependent on the 
data; that is, roughly speaking, when G{g) is close to zero, g should be close 
to 0/. However, when go depends on the endogenous variable X , the third 
property may fail because for any e > 0, there are sequences {g n }%Li C H 
such that 

liminf inf G(g n ) = 0. 

Throughout this paper, we call such a problem as the type-Ill ill-posed in- 
verse problem. In order to achieve the posterior consistency, we need certain 
regularization scheme to make the metric d(g,@i) be continuous with re- 
spect to the risk functional G{g). 



x We will verify this for the nonparametric IV regression model in Section 4. 
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While the literature puts a primary interest on dealing with the type-Ill 
ill-posedness [Hall and Horowitz (2005), etc.], there are relatively fewer re- 
sults that deal with the second type of ill-posedness: 0/ is not necessarily 
a singleton. In this paper, we also allow go to be only partially identified 2 
by the conditional moment restriction (3.1). Such a treatment arises for 
two reasons. First, when the conditional moment restriction is given by the 
nonparametric instrumental variable regression (Example 1.3), the identi- 
fication of go depends on the completeness of the conditional distribution 
of X|W [Newey and Powell (2003)]; however, the completeness assumption 
is hard to verify if the conditional distribution of X|VF does not belong to 
the exponential family. Severini and Tripathi (2006) explored identification 
issues with these models and noted that the point identification can easily 
fail; see Example 3.2 of Severini and Tripathi (2006). For another reason, 
sometimes instead of go itself, we are only interested in a particular char- 
acteristic of it, for example, its linear functional h(go). For example, in the 
nonparametric IV regression, if go(x) represents the inverse demand func- 
tion, then its consumer surplus at some level x* can be written as a functional 
h(go) = Jo go{x)dx — go(x*)x* . In this case, the identification of go might 
not be necessary; as Severini and Tripathi (2006) showed, even if go is not 
identified, it is still possible to point identify its functional h(go). 

3.3. Prior specification. We will apply Theorems 2.1 and 2.2 to three 
types of priors: (i) truncated prior, (ii) thin-tail prior and (hi) normal prior. 
In this section we will focus on the first two types of priors, with which more 
generally consistent results can be derived. 3 

Truncated prior. The prior is supported only on J- n . In particular, we 
consider the uniform and truncated normal priors, respectively, 

uniform prior n(b) = < B n 



'nli 



1=1 



, TT f(bi)I(\bi\<B n ) 

truncated normal tt{o) =11 — , — — — , 

~Jl P{\ Z i\ < B n) 

where {Z^f^ are i.i.d. random variables from iV(0, a 2 ) for some a 1 > 0, 
and /(•) is the probability density function of Zi. The tail probability 

7r{g b £ T c n ) = 0. 



2 In this paper, the partial identification is meant in the frequentist sense, as opposed to 
the Bayesian identification. See a recent work by Florens and Simoni (2011) for a discussion 
of these concepts. 

3 We will describe the normal prior in a later section (Section 4.4) since the technique 
used is somewhat different, which handles mainly the situation of the NPIV model in an 
identifiable situation. 
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Thin-tail prior. The prior tt on b € R 9?l is denned such that the density 
is symmetric in all directions, and ||6|| r follows an exponential distribution 
with mean f3~ r (for some (3 > 0, r > 0). Here ||6|| denotes a Euclidean norm, 

-/3 r u r 



H\M' >u')=e 

which, together with the spherical symmetry, is enough to derive the density 
function, 



(3.8) tt(6) 



r-q, 



where S qn is the area of the q n — 1-dimensional unit sphere in Euclidean 
norm. For this prior, the parameter 1/(5 is roughly the radius of most of the 
prior mass, and r denotes the thinness of the tails outside. The bigger the r 
is, the thinner the tail. 

This prior is very similar to the class of distributions defined in Azzalini 
(1986). Both allow any positive power of the distance to the origin to be 
placed on the exponent. Our density is slightly different and does not, in 
general, include the normal density exactly. However, it is derived in a way 
so that the tail probability has an exact expression. Hence it is convenient 
to impose a regularity condition on the tail probability. 

Florens and Simoni (2009a, 2009b) placed a Gaussian prior whose variance 
decreases to zero with the sample size. Our priors specified here are similar to 
theirs in the sense that the prior tail probability is small: when the truncated 
prior is used, 7r(<7& G J 7 ^) = 0; when the thin-tail prior is used, 7r(<?b 6 J 7 ^) 
decreases exponentially fast in n. Both types of priors ensure that 

P(G(g b )>8 n \X n )=o p (l) 

for some decaying sequence 5 n > that depends on the convergence rate of 
supjr n | G(g) — G(g) \ . The technique of using a prior that decays exponentially 
fast outside a bounded sieve set is commonly used in the nonparametric 
posterior consistency literature; see, for example, Ghosh and Ramamoorthi 
(2003), Ghosal and Roy (2006), Choi and Schervish (2007), Walker (2003) 
and many references therein. 

However, there is an important difference between Florens and Simoni's 
prior settings (2009a) and our own. While Florens and Simoni (2009a) put 
their prior on an infinite-dimensional function space, they require the vari- 
ance of the Gaussian prior to shrink to zero as a regularization scheme in 
order to achieve the posterior consistency. In contrast, our prior is placed 
directly on the sieve coefficients (b\, . . . ,b qn ) in a finite-dimensional vector 
space, and neither the truncated prior nor the thin-tail prior shrinks to 
a point mass. When q n grows slowly with n, it can be shown that 4 for any 
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£>0, 



inf 

g b &Hn,d{g h 



G{gb) y 5n5 



hence the distinguishing ability condition in Theorem 2.2 is satisfied. As 
a result, in our procedure it is the fact that q n grows slowly that plays 
the role of regularization instead of a shrinking prior. Later in Section 4.4, 
we will also verify that with a suitably chosen q n , a nonshrinking normal 
prior can be used to achieve the posterior consistency in the identified NPIV 
model. 

3.4. Posterior consistency. The following assumptions are imposed. 

Assumption 3.1. The data X n = (Xi,. . . ,X n ) are independent and 
identically distributed. 

Assumption 3.2. There exists a positive sequence A n — > such that 



Since T n is compact in H n , as long as the radius of T n grows slowly, 
the uniform convergence condition in Assumption 3.2 can be shown using 
similar techniques to those in Han and Phillips (2006). We will verify it for 
the nonparametric IV regression example in Section 4. 

Assumption 3.3. (i) {<fii, fa, ■ . ■ , 4> qn } forms an orthonormal basis oil-in 
such that E(cj)i(X)4>j(X)) = 5ij, the Kronecker 5. 

(ii) There exist g G 9/, and g* n = Y,1=i b*fa G H n such that \\g* n - g \\ s = 
o(l) as q n — > oo. 

The existence of g* n is simply implied by the definition of a sieve space. 
It is satisfied by the spaces that are spanned by commonly used sieve basis 
functions such as splines, power series, wavelets and Fourier series. For exam- 
ple, if the parameter space is a Sobolev space W%[0, l] dx , where d x = dim(A), 

and || • \\ s is the Sobolev norm, then \\g* n — go\\ s = 0{q n p ^ dx ) for some p > 0; 
sec, for example, Kress [(1999), Chapter 8] and Chen (2007); see also Schu- 
maker (1981) and Meyer (1990) for splines and orthogonal wavelets in other 
function spaces. 

Assumption 3.4. There exists C > such that V#i, g 2 G U, 



sup \G(g)-G(g)\=O p (X n ). 



E\p(Z, 91 ) - p(Z, g 2 )\ < CE\ 9l (X) - g 2 (X)\. 



4 We will verify this for the nonparametric IV regression model. 
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This assumption is trivially satisfied by the nonparametric IV regression 
in Example 1.3. Here we give another example that satisfies this assumption. 

Example 3.1 (Nonparametric quantile IV regression) . Consider the mo- 
del in Example 1.4, in which the conditional moment restriction is given by 

E(p{Z,g )\W,g ) = 0, p(Z,g ) = I( y < go (X)) ~7- 
It is straightforward to verify that for any gi,g2, 

E\p(Z 7 gi) - p(Z,g 2 )\ = E\I( gi (x)<y<g 2 (X)) + I (g2(X)<y<g 1 (X))\ 

= E[P( gi {X)<y<g 2 (X)\X)] 

+ E[P(g 2 (X)<y< gi (X)\X)}. 

Suppose there exists a constant C > such that F y \x(-), the conditional 
c.d.f. of y\X, satisfies 

\ F y\x(yi) ~ F y\x(y2)\ < C\yi - 2/2 1 

for any !/i,j/2 S 1 and x in the support of X. Then the first term on the 
right-hand side is bounded by 

E[P(gi(X) <y< g 2 (X)\X)] < E\F ylx (g 2 (X)) - F y[x ( gi (X))\ 

<CE\g 2 (X)- gi {X)\. 

Likewise, E[P(g 2 (X) <y< gi (X)\X)} < CE\g 2 (X) - gi(X)\. Therefore As- 
sumption 3.4 is satisfied. 

Define 

(3.9) 7n = sup \E(p(Z,g)\W = w)\ + l. 

g£j r n ,w£[0,l] d 

We are able to verify the conditions in Theorem 2.1 with the previous 
assumptions, and establish the following theorem: 

Theorem 3.1 (Risk consistency: truncated prior). Suppose q n = o{n) 
and B n = o(n). Assume 5 n = O(l) is such that there exists go £ Qj whose 
sieve approximation g* n satisfies 

max|c(5* n ), A n , ^log(7 n n)| = o(S n ). 

Then when either the uniform prior or the truncated normal prior is used, 
under Assumptions 3.1-3.4, 

P(G(g h ) < 5 n \X n ) 1. 
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In the following theorem, write X(B n ) = X n and j(B n ) = j n to indicate 
the dependence of A n and -y n on B n , defined in Assumption 3.2 and (3.9), 
respectively. 

Theorem 3.2 (Risk consistency: thin-tail prior). Suppose there exists 
go € 0/ with g* being its sieve approximation in % n , and a sequence B* — > 

oo such that maii{G(g*J,X(B*),-f(B*)e- nX( - B ^/ q "} = o(B* n r /n). In addi- 
tion, suppose 5 n = 0(1) is such that 

m^{G(g;jA(B:)MB* n )e~ nX ^M = o(5 n ). 
Then under Assumptions 3.1-3.4, 

P(G(g b )<5 n \X n )^l. 

Remark 3.1. (1) We will show in the next section that in the nonpara- 
metric IV regression model, 7 n = 0(q n B n ). For the nonparametric quantile 
IV regression in Example 3.1, 7 n is a constant that is bounded away from 
zero. 

(2) Under the conditions of Theorems 3.1 and 3.2, 5 n can be fixed as 
a constant. Namely, V<5 > 0, 

P(G(g b )>5\X n ) = o p (l). 

Roughly speaking, the posterior distribution is asymptotically supported on 
the set where G is minimized. This result has many important applications. 
For example, in the binary treatment effect study, let Y S {0, 1} indicate 
whether a treatment is successful, which is associated with a covariate X. 
Suppose we model the success probability P(Y = 1\X = x) by a nonpara- 
metric function g(x). In this model, 

G{g) = E X {[E(Y\X) - g(X)f} = \\P(Y = l\X)- g(X)f s , 

where \\g\\1 = E(g(X) 2 ). By Theorems 3.1, 3.2, for any e > 0, the posterior 

P(\\P(Y = 1\X) - g b (X)f s < £ |Data) -t p 1, 

which implies that the posterior of g b can recover the success probability 
arbitrarily well with high probability. 

(3) In data mining, this type of result is sometimes called the "risk con- 
sistency." For example, if G was the classification risk, the risk consistency 
result would show that the posterior would effectively minimize the mis- 
classification error. The current definition of G, however, is not the clas- 
sification risk. In nonparametric regression and in the NPIV example, the 
risk G becomes, respectively, E W {[E(Y\W) - g(W)} 2 } and E W {[E(Y\W) - 
E(g(X)\W)] 2 }, which is related to how much E(Y\W) would be missed if 
it was estimated by (something derived from) g. 

The following two theorems establish the posterior consistency without 
assuming the compactness of the parameter space %. 
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Theorem 3.3 (Posterior consistency: truncated prior). Suppose there 
exists go £ 0/ whose sieve approximation g* n satisfies Ve > 

(3.10) max\G(g*),X n , — log(j n n)\=o( inf G(g) ) . 

Then under Assumptions 3.1-3.4, for any e > 0, 

P(d( 5b ,9 / )< £ |X n )^l. 

Theorem 3.4 (Posterior consistency: thin-tail prior). Suppose there ex- 
ists go G 0/ with g* n being its sieve approximation in % n , and a sequence 

B*^oo such that max{G{g* n ), X(B*),-/{B*)e~ nX{ - B ^^ n } = o(B* r /n). In 
addition, suppose Ve > 0, 

(3.11) m a x{G(gl)A(B:)MB* n )e^ B ^} = ( inf G(gj) . 
Then under Assumptions 3.1-3.4, for any e > 0, 

p(d( 56 ,e T )< £ |x^)^i. 

Remark 3.2. (1) The restriction X(B*) = o(B* r /n) in both Theorems 3.2 
and 3.4 requires that r, the thin-tail prior parameter, should not be too 
small; otherwise, no such U* exists. In the NPIV model which will be illus- 
trated in the next section, we need r > 6d + 4, where d = dim(W). 

(2) Conditions (3.10) and (3.11) are similar to Chen and Pouzo's [(2009a), 
condition (3.1)], where they require that q n grow slowly enough so that 
inf sG ^ ni9 g0e G(g) does not decrease too fast for any fixed e > 0. This will 
also be illustrated in Section 4. 

Let h{go) be a linear functional of go, whose practical meaning may be 
of direct interest. For example, if h(go) = E[go(X)uj(X)] for some weight 
function co, then with proper choices of u>, h can be used to test some spe- 
cial properties of go, such as the monotonicity, the convexity, etc. Santos 
(2011). On the other hand, h itself may have interesting meanings. For 
example, when go denotes the inverse demand function in nonparametric 
regression, h(go) can be the consumer surplus [Santos (2012)]. Severini and 
Tripathi (2006) have provided conditions to point identify h(go) even if go 
itself is not identified. 

Example 3.2. Suppose we want to test whether the unknown func- 
tion go is weakly increasing. Note that any weakly increasing function g{x) 
must satisfy f sm(x)g(x) dx > 0; hence the functional of interest here is 
h(go) = f sin(x)go(x) dx. Suppose the joint distribution of (X, W) has den- 
sity function fxw{ x i w )- By Severini and Tripathi (2006), h(go) is point 
identified, if there exists p(w) such that E[p{W) 2 } < oo and E(p(W)\X) = 
s'm(X)/ fx(X) almost surely. 
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Theorems 3.3 and 3.4 imply a flexible way to consistently estimate h with- 
out identifying g$. In the following assumption, condition (i) assumes the 
point identification of h(go). Condition (ii) requires the uniform continuity 
of h, which is satisfied when h(g) = E[g(X)uj(X)] if sup x \w(x)\ < oo and 
E\g 1 -g 2 \<C\\g 1 {X)-g 2 (X)\\ s for any gi ,g 2 G U. 

Assumption 3.5. (i) {h(g) :g£& 1 } = {%o)}; (ii) h : (H, \\ ■ || s ) -> R is 
uniformly continuous. 

Corollary 3.1. Suppose the assumptions of Theorem 3.3 (if the trun- 
cated priors are used) and Theorem 3.4 (if the thin-tail prior is used) are 
satisfied. In addition, suppose Assumption 3.5 holds. When go is not neces- 
sarily point identified, \/5 > 0, 

P(\h(g b )-h(g )\<6\X n )^l. 
4. Nonparametric instrumental variable regression. 

4.1. The model. The nonparametric instrumental variable regression (NPIV) 
model is given by 

Y = g (X) + e, 

where X is endogenous, which is correlated with e. We consider the following 
parameter space and the norm || • 

U = L\X) = {g : E[g(X) 2 ] < oo}, \\gf s = E[g(X) 2 ]. 

In addition, suppose we observe an instrumental variable W £ [0, l] d such 
that E^elVF) = 0. Applications of instrumental variables can be found in 
many standard econometrics texts, for example, Hansen (2002). Let Z = 
(Y, X); the NPIV model is then essentially a conditional moment restricted 
model with p(Z,g) = Y — g(X). 

Let {4>i, 4>2, ■ ■ •} be a set of orthonormal basis functions of L 2 (X). We 
consider the sieve space T-L n = {g € L 2 (X) :g = Yli=ibi4>i}, which can be 
partitioned into rt n = F n U J^, where F n = {^?=i h4>i G "Hn,maxj< gn |6j| < 
B n } as in Section 3. 

We apply the feasible LIL (3.5) to construct the posterior. The log- 
likelihood involves the sample risk functional 

Kin \ 2 

j=i \ i=i j 

which later will be shown to uniformly converge to 

G(g)=E w {[E(Y-g(X)\W)] 2 } 

over F n . The identified region G/ is defined as a subset of L 2 (X) on which 
G(g) = 0. 



18 



Y. LIAO AND W. JIANG 



4.2. Risk consistency. Under mild conditions, we can derive the conver- 
gence rate of sup geJ r n \G(g) — G(g)\. The following assumptions are imposed. 

Assumption 4.1. (i) k~ d = Ofmin^d P(W e R$)); 
(ii) max jm P(W eR]) = 0(k~ d ). ' 

This assumption is satisfied, for example, when W has a continuous den- 
sity function on [0, 1]^ that is bounded away from both zero and infinity. 

Assumption 4.2. There exists C > such that for all i = 1, . . . , q n : 

(i) sup w E{Y 2 \W = w) <C, snp w E(^(X) 2 \W = w) <C; 

(ii) i?(Y|W = w) is Lipschitz continuous with respect to w on [0, l] d ; 

(iii) for any w\,w 2 £ [0, l] d , 

\E{UX)\W = wx) - E(<j>i(X)\W = w 2 )\< C\\ Wl -w 2 \\. 

Condition (iii) requires that the family {E((pi(X)\W = w):i < q n } is Lip- 
schitz equicontinuous on [0, l] d , which is satisfied, for example, when X has 
a density function that is bounded away from zero on the support of X; in 
addition, X\W has a conditional density function fx\w such that for some 
C>0, 

\fx\w(x\wi) - fx\w(x\w 2 )\ < C\\wi - w 2 \\ 
for all x and wi,w 2 G [0, l] d . 5 

Assumption 4.3. There exist g e &i, g* qn = Yllli with XXi h f < 
oo, and a positive sequence {r/ J }^ =1 that strictly decreases to zero as j — > oo 
such that \\g* n — go\\ s = 0(%J as q n — > oo. (We will choose g* n to be the 
projection of go onto % n , unless otherwise noted.) 

Examples of the rate r] qn are discussed earlier behind Assumption 3.3. 

Theorem 4.1. Assume q\B\ = o(min{- v /n/fc^' /2 , k n }). Then under As- 
sumptions 3.1, 4-1, 4-2, 

sup \G(g) - G(g)\ = OJ Qn ^ n + ). 



5 This is simple to show: for any wi,W2, \E(4>i(X)\W = wi) — E((f>i(X)\W = 1112)] < 
(mi fxix))- 1 f\&(x)fx(x)\\f x \ W (x\ Wl ) - f x]w (x\w 2 )\dx < C\\ Wl - w 2 \\E\<t>^X)\ < 
C'||wi — W2W, where the fact that E\cj>i(X)\ is bounded away from infinity is guaranteed 
by condition (i). 
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Define a, semi- norm || * 11^, which is weaker than || • || S; as 
(4-1) \\g\\ 2 w = E w {(E(g(X)\W)) 2 }. 

It can be easily verified that || • \\ w satisfies the triangular inequality, but 
\\g\\ w = does not necessarily imply g = if the conditional distribution 
is not complete. Note that G(g) = ||go — S'llwi hence this semi- norm 
induces an equivalence class characterized by the identified region 0/ = 
{g E L 2 (X):E(Y - g{X)\W) = 0, a.s.}, such that \\g- g \\ w = {) if and only 
if g E 0/. In other words, we can say that go is weakly identified under || • \\ w , 
since for any g E 0/, g and go are equivalent under || • \\ w . 

The following theorem is a straightforward application of Theorems 3.1 
and 3.2: 

Theorem 4.2 (Risk-consistency). Under Assumptions 3.1, 4.. 1-4-3, sup- 
pose 5 n = 0(1) is such that: 

(i) for the truncated priors assuming q\E>\ = o(n 1 ^ 3d+2 ^), 



■ , 3d/2 ^ 
max<J r] 2 n ,q 2 n B 2 [ —=- + —)}= o(5 n 



(ii) for the thin-tail prior with r > 6d + 4, assuming q n = o(n 1 ^ 6rf+4 ^ 1 ' r ), 

( /h 3d/2 1 v/( r - 2 h 

max|^,n 2 /(- 2 )e/(^) + _L) } = o(U 

then 

P(\\g b -g \\ w >5 n \X n ) = o p (l). 

4.3. Ill-posedness and posterior consistency. Define 

T : L 2 (X) -> {( : E[((W) 2 } < 00}, T(g) = E(g(X)\W) 

and write E(Y\ W = w) = C( w )- Then the NPIV model can be equivalently 
written as 

(4.2) Tg = (. 

Under Assumption 4.4, T is a compact linear operator [see Carrasco, 
Florens and Renault (2007)], and therefore is continuous. Equation (4.2) is 
usually called the Fredholm integral equation of the first kind. 

Assumption 4.4. The joint distribution (Y, X, W) is absolutely contin- 
uous with respect to the Lebesgue measure. In addition, suppose fxw(x,w), 
fx(x), fw(w) denote the density functions of (X,W), X and W, respec- 
tively, then 

fxw(x,w) \ 2 

jx(x)jw{w) ax dw < 00. 



fx(x)fw{w) 
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As described before, the problem of inference about go is ill-posed in 
two aspects. The first ill-posedness comes from the identification, which 
depends on the invertibility of T. If T is nonsingular, in which case its null 
space is {0}, <?o can be point identified by go = T~ l Q, but not otherwise. 
See Severini and Tripathi (2006) and D'Haultfoeuille (2011) for detailed 
descriptions of the identification issues. 

Even when go is identified, in which case T _1 exists, as pointed out by 
Florens (2003) and Hall and Horowitz (2005), since L 2 {X) is of infinite 
dimension, and T is compact, T~ 1 is not bounded (therefore is not contin- 
uous). As a result, small inaccuracy in the estimation of £ can lead to large 
inaccuracy in the estimation of go, which is known as the type-Ill ill-posed 
inverse problem described in Section 3.2. When go is partially identified, this 
problem is still present when 

liminf inf G(g)=lhnm£ inf E{[T(g - g )] 2 } = 0. 

By Theorems 3.3, 3.4 and 4.2, in order to achieve the posterior consistency, 
it suffices to verify 



(4.3) 5* n = o( inf G{g) , 

where 



for truncated prior <5* = max^ T] 2 n ,q^B 2 ( n + — 

/ n K n 



for thin-tail prior 6* = max<j rfa, n 2/(r-2) g 2r/(r-2) j + 



k 3d/2 1 x r /(r_2) 



Hence it requires us to derive a lower bound of inf sg ^ niS ^QE G(g) first, and, 
in addition, this lower bound should decay at a rate slower than <5*. 

When go is point identified and a slowly growing finite-dimensional sieve 
is used, Chen and Pouzo (2009a) showed the existence of such a lower bound 
using the singular value decomposition of T. Their approach is briefly illus- 
trated in the following example. 



Example 4.1. Let (gi,g2)x = E[gi(X)g 2 {X)] denote the inner product 

oc 

r- 



of two elements in L 2 (X), and {uj, <fiij, 4>2j}JLi be the ordered singular value 
system of T such that 

Suppose T is nonsingular, then forms an orthonormal basis of L 2 (X). 

Chen and Pouzo (2009a) showed that when {4>\j} q ^ =l is used as the basis in 
the sieve approximation space, Ve > 0,v 2 n = 0(inf ge ^ njg ^ef G(g)). There- 
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fore, condition (4.3) is satisfied if we assume <5* = o(u 2 ). In addition, suppose 
{t'JIjUi decays at a polynomial rate j~ a for some a > 0; then we require 

g n = o(5n a slowly growing sieve dimension. 

We impose the following assumption to derive a lower bound for 
inf^g-^g^ef G{g) and verify (4.3), which, in the identified case, uses more 
general basis functions for the sieve space. Therefore we allow the sieve basis 
to be different from the eigenfunctions of T. A similar approach was used 
by Chen and Reiss [(2011), Section 6.1], who used the wavelets as the sieve 
basis functions while the eigenfunctions of T form a Fourier basis. 

Assumption 4.5. There is a continuous and increasing function ip(-) > 
satisfying lim t _ >0 + ip(t) = such that, for {go,gg n , {Vj}j^=i} as defined in 
Assumption 4.3 and some constants C\,C2 > 0: 

(i) \\9 - 9o\\l > C x ET=i V(V])\(9 ~ 90, ^) x ? for all g £ L 2 (X); 

(ii) - g \\l < C 2 Z 3 <p{ri$)\{go - 9* qn Ao)x\ 2 . 

Remark 4.1. (1) This assumption implies a generalization of the rela- 
tion v* n = O(inf 9g ^ nj9 ^0| G{g)) in Example 4.1. In this assumption, {4>j}JL 1 
are the basis functions whose first q n terms span the sieve approximation 
space. In the identified case, {4>j}'jLi can be a general set of basis functions 
that is different from the eigenfunctions of T. Chen and Pouzo [(2009a), 
Section 5.3] identified the singular value of Example 4.1 as a special 
case of the general f(r]j), in which case Assumption 4.5 is satisfied. In its 
general form, Assumption 4.5 is standard in the literature for the linear ill- 
posed inverse problem when the convergence rate of the estimator is studied; 
see, for example, Nair, Pereverzev and Tautenhahn (2005), Chen and Pouzo 
[(2009a), Assumption 5.2], Chen and Reiss [(2011), Section 2.1], etc. As de- 
scribed above, however, this assumption is also needed in order to verify (4.3) 
and show consistency when general basis functions are used. Blundell, Chen 
and Kristensen (2007) provided sufficient conditions of Assumption 4.5 for 
the NPIV model setting. 

(2) In the partially identified case when 0/ is not a singleton, Assump- 
tion 4.5 is still satisfied, if we take {(frj} C jL 1 to be the eigenfunctions of T*T 
that correspond to its nonzero eigenvalues, where T is the conditional ex- 
pectation operator, and T* is its adjoint. The spectral theory of compact 
operators [Kress (1999)] implies that \\T(g - g )\\ 2 s = YljLi v )\ (9 ~ 9o, 4>j)x\ 2 
for all g £ L 2 (X), where {^J } represent all the (nonzero) eigenvalues of T*T, 
and {4>j} are the corresponding eigenfunctions (the zero eigenvalues of T*T 
do not contribute to the right-hand side of the spectral decomposition). 
Therefore, Assumption 4.5 remains valid with tp{rjj) = v?, with \yf\ de- 
noting the sequence of decreasing nonzero eigenvalues. This idea of using 
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the spectral representation of T*T is related to the commonly used "gen- 
eral source condition" in the literature [Tautenhahn (1998) and Darolles 
et al. (2011)], where, for example, Darolles et al. (2011) used this condition 
to derive the convergence rate of their kernel-based Tikhonov regularized 
estimator in NPIV regression. 

(3) When a more general sieve basis {^•}^ =1 is used in the partially iden- 
tified case, condition (i) of Assumption 4.5 is not generally satisfied. For 
example, suppose there exists g G O/, but g go- By the definition of || • \\ w , 
Wd ~ 9o\\w = 0; but the right-hand side of the displayed inequality in con- 
dition (i) is strictly positive unless {(pj}'jL 1 are the eigenfunctions of T*T. 
To allow for more general sieve basis in this possible approach is to 

assume the true go in the data generating process to lie in a compact set O, 
for example., a Sobolev ball [Chen and Reiss (2011)]. It is then not hard to 
show that inf se @ 9 ^0| G(g) is bounded away from zero. Restricting go inside 
a compact set is actually a quite common approach in nonparametric IV re- 
gression, and the literature is found in Newey and Powell (2003), Blundell, 
Chen and Kristensen (2007), Chen and Reiss (2011), etc. Recently, Santos 
(2012) extended this approach to the partially identified case, with the com- 
pactness restriction. We do not pursue this approach here, since our other 
results on posterior consistency allow a noncompact parameter space. 

As in Chen and Pouzo (2009a), generally the degree of ill-posedness has 
two types: 

(1) mild ill-posedness: (f(rj) = r] a for some a > 0. 

(2) severe ill-posedness: (p(rj) =exp(— n~ a ) for some a > 0. 

Under Assumption 4.5, it can be shown that ^p(r]q n ) = O(inf geWn fl ^0| G{g)) 
for any e > 0; see Lemma C.5 of the supplementary material. Intuitively 
speaking, ip(-) is associated with the singular values of T and is related to 
how severe the type-Ill ill-posed inverse problem is. When the nonzero singu- 
lar values decay at a polynomial rate, (p corresponds to the mildly ill-posed 
case; when the singular values decay at an exponential rate, it corresponds 
to the severely ill-posed case. 

Before formally presenting our posterior consistency result, we briefly 
comment on the role of condition (ii) of Assumption 4.5. Assumption 5.2(h) 
is the so-called "stability condition" in Chen and Pouzo (2009a) that is re- 
quired to hold only in terms of the sieve approximation error on one element 
in 6/. By Theorems 3.3 and 3.4, we require G(g*J = o(m.i geHn ^QB G(g)). 
It can be easily shown that G(g* n ) = 0(?^ n ), and hence G(g* n ) was re- 
placed with r]q n in the condition of Theorem 4.2. In addition, condition (i) 
of Assumption 4.5 implies that ^>(t]g n ) = 0( m f 9 eW„,3^ef G{g)). With condi- 
tion (ii) of Assumption 4.5, it can be further shown that G(g* n ) = 0(rjq n ip(r)g n )) 
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(see Lemma C.6 in the supplementary material). Since r] 2 n = o(l), G(g* n ) = 

o{<p(Vq n )) = o( ini geH n ,g<jtef G{g)) is verified. 

Under this framework, we have the posterior consistency under || • || s : 

Theorem 4.3 (Posterior consistency). Under Assumptions 3.1, 4-1-4-5, 
suppose: 

(i) for the truncated priors assuming o\B\ = o(n 1 ^ 3d+2 ^), 

(4-4) «(V + ^) = ° Mr?l)); 

(ii) for the thin-tail prior with r > 6d + 4, assuming q n = o{n l /^" i+ ^~ l / r ) , 

/h 3d/2 \r/(r-2) 

(4.5) n^tf/ir-V + i-j = o^nl)). 

Then for any e > 0, 

P(d(g b ,e I )>e\X n ) = o p (l). 

4.4. Normal prior. When go is point identified, we can also establish the 
posterior consistency using normal priors 

qn 

(4.6) tt(&) =11^(6;), m(bi)~N(0, a 2 ), 

i=l 

for some constant a 2 > 0. As discussed previously, by restricting q n to grow 
slowly as n — > oo, we do not need a shrinking prior to function as a penalty 
term attached to the log-likelihood for the regularization purpose. 6 There- 
fore a 2 is treated to be a fixed constant that does not depend on n. 

With the assumptions imposed in Sections 4.2 and 4.3, we can verify all 
the conditions in Theorem 2.2, which then leads to the following theorem: 

Theorem 4.4 (Posterior consistency using Gaussian prior). Assume go 
is point identified. Under Assumptions 3.1, 4-l~4-°~, suppose the normal 
prior (4-6) is used, and 

/k 3d/2 xl/3 

(4.7) H^T + S) = ° {V(VL)) ' 
then for any e > 0, 

P(\\g b -go\\s>e\X n ) = o p (l). 

4.5. Choice of tuning parameters. To choose (k n ,q n , B n ) that satisfy (4.4) 
(4.5) and (4.7) for each specified prior, consider the case where rj qn is de- 



6 We thank a referee for pointing this out. 
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creasing as some power of q n [see, e.g., Schumaker (1981) and Meyer (1990)], 
and k n grows at a polynomial rate of n, that is, 

Van ~ 1n V f° r some v > 0, 

(4.8) 

kl d/2 , 1 _ „ 1 



*n k n 3d + 2 

We then have the following corollaries: 

Corollary 4.1 (Truncated prior). Suppose the truncated prior (either 
uniform or truncated normal) is used; then the following choice of (q n ,B n ) 
achieves the posterior consistency, for b < p: 

(i) in the mildly ill-posed case, 

Bl~n\ g n = o(n^)/( 2+2 -)); 

(ii) in the severely ill-posed case, 

B 2 n ~ n \ (?n = ((logn) 1 /^)). 

Corollary 4.2 (Thin-tail prior). Suppose the thin-tail prior is used; 
then the following choice of q n achieves the posterior consistency, forpr > 2: 

(i) in the mildly ill-posed case, 

q n = o(n& r -W r + 2a < r - 2 »); 

(ii) in the severely ill-posed case, 

g n = ((logn) 1 /( 2 ^)). 

Corollary 4.3 (Normal prior). Suppose the normal prior is used, and go 
is point identified, the following choice of q n achieves the posterior consis- 
tency: 

(i) in the mildly ill-posed case, 

q n = o(n^ 1+2av »); 

(ii) in the severely ill-posed case, 

q n = o((logn) 1 /^). 

In the conditions of these consistency results, the choice of tuning pa- 
rameters (q n , B n , r) depend on some parameters that one either knows or 
chooses (d, p), as well as some parameters related to the true model (a, v). 
The latter, although undesirable, cannot be totally avoided when we study 
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the frequentist convergence properties under ill-posedness. [Conditions de- 
pending on the true model are also used, e.g., by Chen and Pouzo (2009a), 
directly in their Corollary 5.1, and indirectly at the end of their Section 3.1.] 

On the other hand, these results can still have meaningful implications 
that do not explicitly depend on the indexes a and p (which are probably 
unknown in practice). For example, we note that in the mildly ill-posed 
situations, the condition on q n would be satisfied if it grows as any finite 
power of log?i. Likewise, in the severely ill-posed situations, the condition 
on q n would be satisfied if it grows as any finite power of log log n. 

In addition, we will indicate in the next section that the current Bayesian- 
flavored treatment can even allow a data-driven choice of the sieve dimen- 
sion q n , using a posterior distribution derived from a mixed prior. 

5. Random sieve dimension. As the sieve dimension q n plays an impor- 
tant role not only in dealing with the ill-posed inverse problem, but also in 
many applied sieve estimation methods, in this section we briefly discuss the 
possibility of choosing it based on a posterior distribution. This will require 
specifying a prior distribution on the sieve dimension first. Since the condi- 
tions of a deterministic q n for consistency only restricts the growth rate, as 
a result, Mq n would also lead to consistency for a positive constant M > 1, 
if q n ensures consistency. 

We denote the sieve dimension by q, let it be random and place a discrete 
uniform prior 

(5.1) 7r( ? ) = Unif{l,...,Mg n } 

for some deterministic sequence q n — > oo and constant M > 1. Then the prior 
on the sieve coefficients b becomes a mixture prior 

(5.2) tt(6) = <QMb\q) = {Mq n )- X *(b\q), 

q=l g=l 

where ir(b\q) follows a prior as specified before for a given sieve dimension q. 
The feasible limited information likelihood is, as before, denoted by L n (b, q). 
We have the joint posterior 

p(g b ,q\X n ) KTv(b\q)L n (b,q). 

It can be shown that the uniform mixture prior can also lead to the 
posterior consistency. 

Theorem 5.1 (RANDOM q). For each theorem in Sections 3 and 4, 
suppose the corresponding conditions are satisfied for the deterministic sieve 
dimension Mq n instead of q n , for some M > 1. Then all the posterior con- 
sistency results stated in Sections 3 and 4 (on risk consistency and on esti- 
mation consistency) remain valid for the mixed prior (5.2) with random q 
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following prior (5.1), with no extra conditions, with the following two excep- 
tions: 

(1) We will additionally assume that (\ogq n )/n = o(5 n ) holds for the 
statement of Theorem 3.2 to hold. 

(2) We will additionally assume that (logg n )/ra = o(inf ffe % n>g £ej G(g)) 
for the statement of Theorem 3.2 to hold. 

Note that the uniform prior is used for q, which gives zero prior probability 
on very large choice beyond Mq n . However, from a technical point of view, 
the result can be extended to the case with tails of prior on q extending to 
infinity, as long as the tail is thin enough so that ir(q > Mq n ) is dominated 
by a small enough upper bound. 

The marginal posterior of q is given by 

(5.3) p(q\X n )^ J n(b\q)L n (b,q)db. 

Practically, we can choose q from p(q\X n ). 

6. Conclusion and discussion. We studied the nonparametric conditional 
moment restricted model in a quasi-Bayesian approach, with a special focus 
on the large sample frequentist properties of the posterior distribution. There 
was no distribution assumed on the data generating process. Instead, we de- 
rived the posterior using the limited information likelihood (LIL), allowing 
the proposed procedure to be simpler than the traditional nonparametric 
Bayesian approach which would model the data distribution nonparametri- 
cally. There are several alternative moment-condition-based likelihood func- 
tions. The empirical likelihood [Owen (1990))] and the generalized empirical 
likelihood [Imbens, Spady and Johnson (1998), Newey and Smith (2004) 
and Kitamura (2006)] are typical examples. It is still possible to establish 
the posterior consistency if these alternative nonparametric likelihoods are 
used, which is left future research direction. 

The parameter space H does not need to be compact. We approximate Tl 
using a finite-dimensional sieve space H n , and the regularization is carried 
out by a slowly growing sieve dimension q n . We then studied in detail the 
NPIV model and verified all the sufficient conditions proposed in Section 3 
in order for the posterior to be consistent. 

It is also possible to achieve the posterior consistency using a larger sieve 
dimension q n . In this case, the regularization is carried out by a truncated 
normal prior with shrinking variance, and the log-prior is then a regular- 
ization penalty attached to the log-likelihood. Conditions (3.10), (3.11) and 
Assumption 4.5 can be relaxed. We describe this procedure in the Technical 
Report [Liao and Jiang (2011b)]. 
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An interesting research direction is to derive the convergence rate. With 
all the tools given in this paper, it is possible to obtain the rate of conver- 
gence of our procedure. However, the rate would be sub-optimal, possibly 
due to the technical bound (2.1) used in this paper. It would be interest- 
ing to develop a method based on a bound tighter than (2.1), in order to 
prove the nonparametric minimax optimal rate of convergence as in Chen 
and Pouzo (2009b). 

In applications, our method requires a priori choices of (k n ,q n ), and B n 
for the truncated prior. We conjecture that the finite sample behavior of the 
posterior is robust to the choice of (k n ,B n ). However, it should be sensitive 
to q n , as a large value of q n may lead to over-fitting. Therefore, we proposed 
an approach to allow for a random sieve dimension by putting a discrete 
uniform prior on it and selecting it from its posterior. With the upper bound 
of the uniform prior Mq n growing under the same rate restriction as before, 
the posterior consistency is also achieved. This feature, however, requires 
specifying Mq n . In practice, one may start with a moderate level Mq n that 
is less than ten. In the NPIV setting, Horowitz (2010) recently introduced 
an empirical approach for selecting q n . Moreover, developing methods of 
selecting (k n ,B n ) in a Bayesian (or quasi-Bayesian) approach is another 
important research topic. 
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Technical proofs (DOI: 10.1214/11-AOS930SUPP; .pdf). This supplemen- 
tary material contains the proofs of all the results developed in the main 
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